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Abstract 

Statistical models of 3D human shape and pose learned from scan databases have 
developed into valuable tools to solve a variety of vision and graphics problems. 
Unfortunately, most publicly available models are of limited expressiveness as 
they were learned on very small databases that hardly reflect the true variety 
in human body shapes. In this paper, we contribute by rebuilding a widely 
used statistical body representation from the largest commercially available scan 
database, and making the resulting model available to the community (visit 
http: // humanshape. mpi-inf. mpg. de). As preprocessing several thousand 
scans for learning the model is a challenge in itself, we contribute by developing 
robust best practice solutions for scan alignment that quantitatively lead to the 
best learned models. We make implementations of these preprocessing steps 
also publicly available. We extensively evaluate the improved accuracy and 
generality of our new model, and show its improved performance for human 
body reconstruction from sparse input data. 
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1. Introduction 


Statistical human shape models represent variations in human physique and 
pose using low-dimensional parameter spaces, and are valuable tools to solve 
difficult vision and graphics problems, e.g. in pose tracking or animation. De¬ 
spite significant progress in modeling the statistics of the complete 3D human 
shape and pose m 0 ns n mi na nu only few publicly available statistical 
3D body shape spaces exist mm]. Further on, the public models are often 
learned on only small datasets with limited shape variations [T6|. The reason is 
a lack of large representative public datasets and the significant effort required 
to process and align raw laser scans prior to learning a statistical shape space. 

This paper contributes by systematically constructing a model of 3D human 
shape and pose from the largest commercially available dataset of 3D laser 
scans [26] and making it publicly available to the research community (Section 
[2|. Our model is based on a simplified and efficient variant of the SCAPE 
model [3] (henceforth termed S-SC APE space ) that was described by Jain et 
al. [18] and used for different applications in computer vision and graphics 1T8I 
121 [231 El [20], but was never learned from such a complete dataset. This 
compact shape space learns a probability distribution from a dataset of 3D 
human laser scans. It models variations due to changes in identity using a 
principal component analysis (PCA) space, and variations due to pose using a 
skeleton-based surface skinning approach. This representation makes the model 
versatile and computationally efficient. 

Prior to statistical analysis, the human scans have to be processed and 
aligned to establish correspondence. We contribute by evaluating different vari¬ 
ants of the state-of-the-art techniques for non-rigid template fitting and posture 
normalization to process the raw data [U ESI M EQ. Our findings are not 
entirely new methods, but best practices and specific solutions for automatic 
preprocessing of large scan databases for learning the S-SCAPE model in the 
best way (Section |3|. First, shape and posture fitting of an initial shape model 
to a raw scan prior to non-rigid deformation considerably improves the results. 
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Second, multiple passes over the dataset improve initialization and thus increase 
the overall fitting accuracy and statistical model qualities. Third, posture nor¬ 
malization prior to shape space learning leads to much better generalization and 
specificity. 

The main contribution of our work is a set of S-SCAPE spaces learned 
from the largest database that is currently commercially available [26,. The 
differences in our S-SCAPE spaces stem from differences in the registration 
and pre-alignment of the human body scans. We evaluate different data pro¬ 
cessing techniques in Section [4] and the resulting shape spaces in Section [5] 
Finally, in Section [6] we compare our S-SCAPE spaces to the state of the art S- 
SCAPE space learned from a publicly available database [16] for the application 
of reconstructing full 3D body models from partial depth data. Experimental 
evaluation clearly demonstrates the advantages of our more expressive shape 
models in terms of shape space quality and performance on the task of recon¬ 
structing 3D human body shapes from partial depth observations. 

We release the new shape spaces with code to (1) pre-process raw scans and 
(2) fit a shape space to a raw scan for public usage. We believe this contribution 
is required for future development in human body modeling. Visit http: // 
humanshape. mpi-inf. mpg. de to download code and models. 

1.1. Related work 

Datasets. Several datasets have been collected to analyze populations of 3D 
human bodies. Many publicly available research datasets allow for the analysis 
of shape and posture variations jointly; unfortunately they feature data of at 
most 100 individuals 0 EH d, which limits the range of shape variations. We 
therefore use CAESAR database [26] . the largest commercially available dataset 
to date that contains 3D scans of over 4500 American and European subjects 
in a standard pose, as it represents much richer sample of the human physique. 

Statistical shape spaces of 3D human bodies. Building statistical shape 
spaces of human bodies is challenging, as there is strong and intertwined 3D 
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shape and posture variability yielding a complex function of multiple correlated 
shape and posture parameters. Methods to learn this shape space usually follow 
one of two routes. The first group of methods learn shape- and posture-related 
deformations separately and combine them afterwards [3 US El HH1 ED H2). 
These methods are inspired by SCAPE model [3] that couples a shape space 
learned from variations in body shape with a posture space learned from de¬ 
formations of a single subject. This method has recently been enhanced to 
capture deformations related to breathing [33] and dynamic motions m- Most 
SCAPE-like models use a set of transformations per triangle to encode shape 
variations in a shape space. Hence, to convert between the vertex coordinates 
of a processed scan and its representation in shape space, a computationally 
demanding optimization problem needs to be solved. To overcome this dif¬ 
ficulty, a simplified version of SCAPE model ( S-SCAPE ) was proposed [18]. 
S-SCAPE operates on vertex coordinates directly and models pose variation 
with an efficient skeleton-based surface skinning approach dHl El ED). Re¬ 
cently, two alternative multi-linear shape spaces have been proposed that also 
operate directly on vertex coordinates [21] T9] . 

Another group of methods intends to perform simultaneous analysis of shape 
and posture variations dcng. These methods learn skinning weights for cor¬ 
rective enveloping of posture-related shape variations, which allows to explore 
both shape and posture variations using a single shape space. Furthermore, it 
allows for realistic muscle bulging as shape and posture are correlated [22] . It 
has been shown, however, that for many applications in computer vision and 
graphics this level of detail is not required and simpler and computationally 
more efficient shape spaces can be used [H [HI [24] [23]. 

Mesh registration. Mesh registration is performed on the scans to bring them 
in correspondence for statistical analysis. Two surveys [34, 32 review such 
techniques, and a full review is beyond the scope of this paper. Allen et al. [T] 
use non-rigid template fitting to compute correspondences between human body 
shapes in similar posture. This technique has been extended to work for varying 
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postures [3j [2j |T6] and in scenarios where no landmarks are available [37] . In 
this work, we evaluate a non-rigid template fitting approach inspired by Allen 
et al. [1|. 

Applications. Statistical spaces of human body shape and posture are applica¬ 
ble in many areas including computer vision, computer graphics, and ergonomic 
design; our new model that was learned on a large commercially available dataset 
is beneficial in each of these applications. Statistical shape spaces have been used 
to predict body shapes from partial data, such as image sequences and depth im¬ 
ages 123 El EDI Sol H M OH 0 Hz] and semantic parameters na hi nnm [mi Hz]. 
Furthermore, they have been used to estimate body shapes from images [5] and 
3D scans pang of dressed subjects. Given a 3D body shape, statistical shape 
spaces can be used to modify input images m or videos |18], to automatically 
generate training sets for people detection HOES], or to simulate clothing on 
people [12]. 

2. Statistical modeling with SCAPE 

We briefly recap the efficient version of the SCAPE model [18] we build on 
and discuss its differences to the original SCAPE model [3] in more detail. For 
learning the model, both methods assume that a template mesh T containing 
N vertices has been deformed to each raw scan in a database. All scans of the 
database are assumed to be rigidly aligned, e.g. by Procrustes Analysis m- 

2.1. Original SCAPE model 

In the original SCAPE model, the transformation of each triangle of T is 
modeled as combination of three linear transformations R m ,i £ SO(3) and 
Q mi £ IR 3x3 controlling posture, and C m ^ £ IR 3x3 controlling body shape. 
Index i indicates one particular scan T is fitted to. Fitting result after rigid 
alignment with T is denoted as instance mesh M$. 

Shape deformations C m ,i encode per-triangle deformations that can be ap¬ 
plied to change the body shape of the person in the same standard posture. 
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A low-dimensional space of plausible shape deformations Cis computed by 
performing PC A on the training dataset captured in standard posture. 

To represent posture changes, two transformations are used: Rrepre¬ 
sents the posture of the person as rotation induced by the deformation of an 
underlying rigid skeleton, and Q m i encodes the individual deformations of each 
triangle that originates from varying body shape or non-rigid posture depen¬ 
dent surface deformations such as muscle bulging. Computing Q m i for each 
triangle separately is an under-constrained problem. Therefore, smoothing is 
applied such that Q m?i of neighboring triangles become dependent. Finally, the 
dimensionality of the transformations R m ,i and Q m i is reduced with the help 
of a kinematic chain model. 

In this way, SCAPE obtains a flexible model that covers a wide range of pos¬ 
sible shape and posture deformations. However, as the model does not explic¬ 
itly encode vertex positions, a computationally expensive optimization problem 
needs to be solved in order to reconstruct the mesh surface. 

2.2. Simplified SCAPE (S-SCAPE) space 

The aforementioned computational overhead is often prohibitive in applica¬ 
tions where speed is more important than the overall reconstruction quality, or 
when many samples need to be drawn from the shape space. S-SCAPE space [T8] 
reconstructs vertex positions in a given posture and shape without need of solv¬ 
ing a Poisson system. To learn the model, only laser scans in a standard posture 
Xo are used. Meshes Mj are used to learn a PCA model that represents each 
shape using a parameter vector cp E and can generate new models (repre¬ 
sented in homogeneous coordinates) with body shape p in posture xo as 

M^, Xo = C(/?-f M. (l) 

C E R 4ArxD with D the dimension of the PCA space is the matrix computed 
using PCA and M is the mean body shape of the training database. 

This shape space only covers variations in body shape but not in posture. To 
enable the latter an articulated skeleton is fitted to the average human shape 
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and linear blend skinning weights are used to attach surface to bones. This 
allows to deform a body with fixed shape po into an arbitrary posture x as 

B 

Pi ( M vo,x) = J2 w i,j' R j(x)Pi (M yo , Xo ), (2) 

3 = 1 

where (M^ 05X ) is the homogeneous coordinate of the i-th vertex of M^ 05X , B 
is the number of bones used for the rigging, R j G IR 4x4 is the transformation 
of the j-th bone, and Wij are the rigging weights. We use the rigging and 
skeleton consisting of B = 15 bones proposed by Jain et al. [18] • The skeleton is 
controlled by 30 pose parameters corresponding to rigid transformations, joint 
angles and scale. 

For reconstructing a model of shape p in skeleton posture y, the method 
first calculates a personalized mesh M<^ Xo using p, and subsequently applies 
linear blend skinning to the personalized mesh to obtain the final mesh M^ ?x . 
This can be expressed in matrix notation as 

M V)X = R( X )C^ + R(x)M, (3) 

where R(y) G R 4Arx4Ar is a block-diagonal matrix containing the per-vertex 
transformations. While decoupling of shape and posture modeling by S-SCAPE re¬ 
sults in lower level of details (e.g. posture-specific deformations such as muscle 
bulging may be missing), it leads much faster reconstruction speed, especially 
when the personalized mesh and skeleton can be precomputed. We argue that 
in many applications speed may be much more important than the overall re¬ 
construction quality and build on this simple and efficient shape space in this 
work. 

3. Data processing 

This section describes our pre-processing procedure that allows to establish 
correspondences between raw laser scans. We demonstrate best-practice solu¬ 
tions for non-rigid template fitting, effective initialization strategies, introduce 
novel human-in-the-loop bootstrapping approach that allows to improve the 
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correspondences, and finally explore postures normalization strategies. Tools to 
reproduce these steps are made publicly available. 

3.1. Non-rigid template fitting 

Our method to fit a human shape template T to a human scan S is inspired 
by Allen et al. pQ. In non-rigid template fitting (henceforth abbreviated NRD ), 
each vertex of T is transformed by a 4 x 4 affine matrix Awhich allows 
for twelve degrees of freedom during the transformation. The aim is to find 
a set of matrices A i that align vertices of the deformed template M to the 
corresponding points of S in the best possible way. The fitting is done by 
minimizing a combination of data, smoothness and landmark errors. 

Data term. The data term requires each vertex of the transformed template 
to be as close as possible to its corresponding vertex of S, and takes the form 

N 

Ed = Y,Wi\\ A iPi~ NN i( s )\\F, (4) 

where Wi weights the error contribution of each vertex, ||.||_p denotes the Frobe- 
nius norm, and NNi is a closest compatible point in S. If surface normals of 
closest points are less than 60° apart and the distance between the points is less 
than 20 mm, we set Wi to 1, otherwise to 0. 

Smoothness term. Fitting using Ed only may lead to situations where neigh¬ 
boring vertices of T match to disparate vertices in S. To enforce smooth surface 
deformations we use a smoothness term E s that requires affine transformations 
applied to connected vertices to be similar, i.e. 

E s = ^ 11 Ai Aj|||.. (5) 

{hi|(Pi,Pi)Gedges(T)} 

Landmark term. Although using Ed and E s would suffice to fit two surfaces 
that are close to each other, the optimization may get stuck in a local minimum 
when T and S are far apart. A remedy is to identify a set of points on T 
corresponding to known anthropometric landmarks on S. In each CAESAR 


scan these are obtained by placing markers on each subject prior to scanning. 
Our landmark term penalizes misalignments between landmark locations 

M 

El = - 1*11 ( 6 ) 

i=1 

where ki is the landmark index on T, and 1^ is the landmark point on S. Al¬ 
though there are only 64 landmarks compared to the total number of 6449 
vertices, good landmark fitting is enough to get the deformed surface of T close 
to Sand avoid local convergence. 

Combined energy. The three terms are combined into a single objective 

E = aEd + f3E s + 7 Ei. (7) 

For optimization we use L-BFGS-B [42] . We vary the weights a , (3 and 7 
according to the following empirically found schedule. We first perform a single 
iteration of optimization without data term by setting a =» 0, /3 = 10 6 , 7 = 10 -3 , 
which allows to bring the surfaces into a rough correspondence. We then allow 
the data term to contribute by setting a = 1, /3 = 10 6 , 7 = 10 -3 . In addition, 
we relax smoothness and landmark weights after each iteration of fitting to 
/3 := 0.25 p and 7 := 0.257, thus allowing the data term to dominate. This is 
repeated until f3 < 10 3 . Reducing f3 increases the flexibility of deformation and 
allows T to better reproduce fine details, while reducing 7 is necessary due to 
unreliable placement of landmarks in some scans. 

3.2. Initialization 

For non-rigid template fitting to succeed, T should be pre-aligned to S. We 
explore two initialization strategies. 

A first standard way to initialize NRD is to use a static template with 
annotated landmarks. Corresponding landmarks are then used to rigidly align 
S to T. 

A second way to initialize the fitting is to start with a S-SCAPE space that 
was learned from a small registered dataset. Fitting the shape space to a scan is 
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achieved by finding shape and posture parameters ip and x such that M^ ;X (see 
Eq. [3| is close to S. To this end, E (Eq. [7]) is minimized with respect to ip and 
X- To minimize E depending on ip and x, we use the vertices of M^ x , and 
set the per-vertex deformations A i to the identity. That is, the deformation of 
the body shape M^ ?x is exclusively controlled by the parameters ip and x- As 
in this case neighboring vertices do not move independently, the term E s is not 
required, and we set (3 = 0 . 

To find a good local minimum, good initialization is required. We found a 
two-step optimization approach to work well in practice. First, we set a = 0 and 
7 = 1 and optimize E with respect to x while fixing ip, which fits the posture 
of M^ x to S with the help of landmarks. Second, we set a m 1 and 7 = 0 and 
optimize E with respect to ip and x iteratively. For increased efficiency, each 
iteration optimizes E with respect to x in a first step and with respect to ip in a 
second step. After each iteration, the set NNi(S) is recomputed. This iterative 
procedure is repeated until E does not change significantly. Iterative interior 
point method is used for optimization. 

3.3. Bootstrapping 

In many cases, even after non-rigid template fitting ( NRD ), fitted mesh M 
is far from the target human scan S. Learning from registered scans with a high 
fitting error may capture unrealistic shape deformations. We thus propose the 
following human-in-the-loop bootstrapping learning process: after each fitting 
iteration we visually examine each registered scan, discard registered scans of 
low quality, and learn a S-SCAPE space using the registered scans that passed 
the visual inspection; learned S-SCAPE space is then used during initialization 
of the next fitting pass and the process is repeated. This bootstrapping process 
is performed for multiple iterations until nearly all registered scans pass the 
visual inspection. Note that visual inspection is required, as low average fitting 
errors do not always correspond to good results, since the fitting of localized 
areas may be inaccurate. 
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3.4- Posture normalization 

The S-SC APE space used in this study decouples learning of shape and 
posture variations and learns shape variations via PC A on the registered scans 
captured in a standard posture. However even standard postures may still 
contain slight posture variations, mostly due to movements of arms. Thus PCA 
may learn global shape variations due to variation in posture. In order to address 
this issue we perform posture normalization of the registered scans based on two 
approaches [38| l2T]. as explained in the following. 

Wuhrer et al. [38] factor out variations due to posture changes by perform¬ 
ing PCA on localized Laplacian coordinates. While this approach leads to 
better shape spaces, it is difficult to directly apply this approach to the S- 
SCAPE spaces learned using Cartesian coordinates. We therefore compute a 
posture-normalized version of each fitted mesh M* in the following way: we 
start with a mean shape M computed over all and use [38] to optimize the 
localized Laplacian coordinates of M to be as close as possible to M^. This 
leads to fittings that have the body shape of in the common normalized 
posture of M. 

Neophytou and Hilton [21] normalize the posture of each processed scan 
using a skeleton model and Laplacian surface deformation. While such nor¬ 
malization may introduce artifacts around joints when the posture is changed 
significantly, this approach is suitable to account for minor posture variations of 
CAESAR scans. We use this method to modify the posture of each fitted mesh 
Mj. 

4. Evaluation of template fitting 

We now evaluate different components of our registration procedure on CAE¬ 
SAR dataset [26] . Each CAESAR scan contains 73 manually placed landmarks. 
We exclude several landmarks located on open hands, as those are missing for 
our template, resulting in 64 landmarks used for registration. Furthermore, we 
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remove all laser scans without landmarks and corrupted scans, resulting in 4308 
scans. 


4-1 . Implementation details 

Non-rigid template fitting requires a human shape template as input, and 
the initialization procedure requires an initial shape space. We use registered 
scans of 111 individuals in neutral posture of the MPI Human Shape dataset m 
to compute these initializations. 

However, MPI scans have artifacts such as spiky non-smooth surfaces in 
the areas of head and neck. We smooth these areas by identifying problematic 
vertices and by iteratively recomputing their positions as an average position 
of direct neighbors. Furthermore, due to privacy reasons, head vertices of each 
human scan were replaced by the same dummy head, which is not representative 
and of low quality at the backside. We adjust the vertex compatibility criteria 
to compute nearest neighbors during NRD by allowing 30° deviation of the head 
face normals while increasing the distance threshold to 50 mm. 


We employ the algorithm from Section 3.1 to compute correspondences for 
the CAESAR dataset. One inconsistency between the datasets is that the hands 
in the MPI Human Shape dataset are closed, while they are open in the CAE¬ 
SAR dataset. As remedy, we set a and 7 to zero for hand vertices in Eq. [7| thus 
only allowing E s to contribute. Prior to fitting, we sub-sample each CAESAR 
scan to have a total number of vertices that exceeds the number of vertices of T 
by a factor of three (6449 vertices in T vs. 19347 vertices in S). This provides 
a good trade-off between fitting quality and computational efficiency. 


4.2. Quality measure 

Measuring the accuracy of surface fitting is not straightforward, as no ground 
truth correspondence between S and T is available. We evaluate the fitting 
accuracy by finding the nearest neighbor in S for each fitted template vertex. 
If this neighbor is not further than 50 mm from its correspondence in T and 
its face normals do not deviate more than 60°, the Euclidean vertex-to-vertex 
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(a) Total fitting error 


(b) Mean fitting error 


Figure 1: Fitting error on the CAESAR dataset when using the S-SCAPE space m alone, 
NRD alone, and initializing using S-SCAPE prior to NRD with different weighting schedules 
( S-SCAPE + NRD , S-SCAPE + NRD CW). Shown are (a) the proportion of vertices [%] 
with fitting error below a threshold and (b) the average fitting error per vertex. 

distance is computed. In our experiments we report both the proportion of 
vertices falling below a certain threshold and the distance per vertex averaged 
over all fitted templates. In the following, we first show the effects of various 
types of initialization and weighting schemes in the NRD procedure on the 
fitting error. Second, we show the effect of performing multiple bootstrapping 
rounds. 


4- 3. Initialization 

First, we evaluate two different initialization strategies used in our fitting 
procedure. We compare the results when using an average human template 
(NRD) to the case when additionally using the S-SCAPE space learned on 
the MPI Human Shape dataset (S-SCAPE + NRD) for initialization. We also 
demonstrate the effects of both non-rigid deformation schemes on the fitting 
accuracy and finally compare to the results when using the publicly available 

5- SCAPE space by Jain et al. fT§T alone (S-SCAPE). 

The results are shown in Fig. [l] The total fitting error in Fig. [lja) shows 
that NRD achieves good fitting results in the low error range of 0 - 10 mm, as it 
can produce good template fits for the areas where T is close to S. However, as 
NRD is a model-free method, the smooth topology of T may not be preserved 
during the deformation, e.g. convex surfaces of T may be deformed into non- 
convex surfaces after NRD. This leads to large fitting errors for areas of T that 
are far from S. S-SCAPE + NRD uses a shape space fitting prior to NRD, 
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which allows for a better initial alignment of T to S. Note that S-SCAPE + 
NRD results in a better fitting accuracy in the high error range of 10 — 20 mm. 
The fitting result by S-SC APE + NRD favorably compares against using S- 
SCAPE alone. Although S-SCAPE results into deformations preserving the 
human body shape topology, the shape space is learned from the relatively 
specialized MPI Human Shape dataset containing mostly young adults and thus 
cannot represent all shape variations. 

We also analyze the differences in the mean fitting errors per vertex in 
Fig. [TJb). NRD achieves good fitting results for most of the vertices. How¬ 
ever, the arms are not fitted well due to differences in body posture of T and S. 
Furthermore, the average fitting error is not smooth, which shows that despite 
using E s , NRD may produce non-smooth deformations. In contrast, the result 
of S-SCAPE + NRD is smoother and has a lower fitting error for the arms. 
Clearly, the average fitting error of S-SCAPE is much higher, with notably 
worse fitting results for arms, belly and chest. 

4-4- NRD parameters 

Second, we evaluate the influence of the weight relaxation during NRD on the 
fitting accuracy. Specifically, we compare the standard weighting scheme where 
weights are relaxed in each iteration ( S-SCAPE + NRD) to the case where the 
weights stay constant ( S-SCAPE + NRD CW). Fig. [lja) shows that the total 
fitting error of S-SCAPE + NRD is lower than S-SCAPE + NRD CW. This 
is because S-SCAPE + NRD CW enforces higher localized rigidity by keeping 
weights constantly high, while S-SCAPE + NRD relaxes the weights so that 
T can fit more accurately to S. This explanation is supported by consistently 
higher per-vertex mean fitting errors in case of S-SCAPE + NRD CW compared 
to S-SCAPE + NRD , as shown in Fig. [ljb). The highest differences are in the 
areas of high body shape variability, such as belly and chest. Different weight 
reduction schemes such as /3 := 0.5/3, 7 := 0.5y and (3 := 0.25/3, 7 := 0.257 lead 
to better fitting accuracy compared to constant weights, with the latter scheme 
achieving slightly better results and faster convergence rates. We thus use the 
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(a) Total fitting error 



iteration 0 iteration 2 iteration 4 


(b) Mean fitting error on a sample scan 


Figure 2: Fitting error after up to four bootstrapping rounds over the CEASAR database 
when using the S-SCAPE [IS] as initialization for iteration 0. Shown are (a) the proportion 
of vertices [%] with fitting error below a threshold and (b) the average fitting error per vertex. 


proposed weight reduction scheme in the following. 


4-5. Bootstrapping 

Third, we evaluate the fitting accuracy before and after performing mul¬ 
tiple rounds of bootstrapping. To that end, we use the output of S-SCAPE 
+ NRD (iteration 0) to learn a new statistical shape space, which is in turn 
used to initialize NRD during the second pass over the data (iteration 1). This 
process is repeated for five passes. The number of registered scans that passed 
the visual inspection after each round is 1771, 3253, 3641, 4237 and 4301, re¬ 
spectively. This results show that bootstrapping allows to register and thus to 
learn from an increasing number of scans. Fitting results are shown in Fig. [2] 
The close-up shows that although the overall fitting accuracy before and after 
bootstrapping is similar, bootstrapping allows to slightly improve the fitting 
accuracy in the range of 10 — 30 mm. Fitting results after three passes over 
the dataset (iteration 2) are slightly better compared to the initial fitting (it¬ 
eration 0), and the accuracy is further increased after five passes (iteration 4). 
Fig. [2] (b) shows sample fitting results before and after several bootstrapping 
rounds. Largest improvements are achieved for the belly and chest - areas with 
large shape variability. The fitting improves with an increasing number of boot¬ 
strapping rounds. We use the fitting results after five passes (iteration 4) to 
learn the S-SCAPE space used in the following. 
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5. Evaluation of statistical shape space 


In this section, we evaluate the S-SCAPE space using the statistical quality 
measures of generalization and specificity {5Tj . 

5.1. Quality measure 

We use two complementary measures of shape statistics. Generalization 
evaluates the ability of a shape space to represent unseen instances of the object 
class. Good generalization means the shape space is capable of learning the 
characteristics of an object class from a limited number of training samples, 
poor generalization indicates overfitting of the training set. Generalization is 
measured using leave-one-out cross reconstruction of training samples, i.e. the 
shape space is learned using all but one training sample and the resulting shape 
space is fitted to the excluded sample. The fitting error is measured using the 
mean vertex-to-vertex Euclidean distance. Generalization is reported as mean 
fitting error averaged over the complete set of trials, and plotted as a function 
of the number of shape space parameters. It is expected that the mean error 
decreases until convergence as the number of shape space parameters increases. 

Specificity measures the ability of a shape space to generate instances of the 
object class that are similar to the training samples. The specificity test is per¬ 
formed by generating a set of instances randomly drawn from the learned shape 
space and by comparing them to the training samples. The error is measured 
as average distance of the generated instances to their nearest neighbors in the 
training set. It is expected that the mean distance increases until convergence 
with increasing number of shape space parameters. We follow Styner et al. m 
and generate 10,000 random samples. 

5.2. Bootstrapping 

We evaluate the influence of bootstrapping on the quality of the statisti¬ 
cal shape space by comparing models obtained after zero, one, two and four 
iterations of bootstrapping. The geometry of the training samples changes in 
each bootstrapping round, which makes the generalization and specificity results 
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Figure 3: Influence of different design choices on statistical quality measures. Shown are 
influence of (a) bootstrapping, (b) number of training samples and (c) posture normalization 
on generalization (top row) and specificity (bottom row). Best viewed with zoom on the 
screen. 


incomparable across different shape spaces. We thus use the training samples 
obtained after four iterations of bootstrapping as “ground truth”, i.e., the recon¬ 
struction error of generalization and the nearest neighbor distance of specificity 
for each shape space is computed w.r.t. fitting results after four bootstrap¬ 
ping rounds. This allows for a fair comparison across different statistical shape 
spaces. 

The results are shown in Fig. [3^a). Generalization error is already low after 
a single iteration of bootstrapping because after one iteration, the shape space 
is learned from a significantly larger number of training samples, thereby using 
samples with higher shape variation that were discarded in the 0 th iteration. 
The following rounds of bootstrapping have little influence on generalization 
and specificity, with the shape space after four iterations resulting in a slightly 
lower specificity error than for previous iterations for a small number of shape 
parameters. 
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5.3. Number of training samples 

To evaluate the influence of the number of training samples, we vary the 
number of samples obtained after four bootstrapping iterations. Specifically, 
we consider subsets of 50, 100, 1,000 and 4,307 (all — 1) training samples. 
To compute a shape space, the desired number of training shapes are sampled 
from the entire set of training samples. For generalization, we cross-evaluate 
on all 4, 308 training samples by leaving one sample out and by sampling the 
desired number of training shapes from the remaining samples. For specificity, 
we compute the nearest-neighbor distances to all 4, 308 training samples to find 
the closest sample. 

The results are shown in Fig.[3jb). The shape space learned from the smallest 
number of samples performs worst. Increasing the number of samples consis¬ 
tently improves the performance with the best results achieved when using the 
maximum number. Both generalization and specificity error reduction is most 
pronounced when increasing the number of samples from 50 to 100. Further in¬ 
creasing the number of samples to 1,000 affects specificity much stronger than 
generalization. This shows that the shape space learned from only 100 samples 
generalizes well, while its generative qualities are poor. Increasing the number 
of samples from 1,000 to 4,307 only slightly reduces both generalization and 
specificity errors, which shows that a high-quality statistical shape space can be 
learned from 1,000 samples. 

5.4■ Posture normalization 

Finally, we evaluate the generalization and specificity of the shape space ob¬ 
tained when performing posture normalization using the methods of Wuhrer et al. 
[38] ( WSX) and Neophytou and Hilton [21] ( NH ). The results are shown in 
Fig. i (c). Posture normalization significantly improves generalization and 
specificity, with WSX achieving the best result. The reduction of the aver¬ 
age fitting error in case of generalization is highest for a low number of shape 
parameters. This is because both WSX and NH lead to shape spaces that are 
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more compact compared to the shape space obtained with unnormalized train¬ 
ing shapes. Additionally, both posture-normalized shape spaces exhibit much 
better specificity. Compared to the shape space trained before posture nor¬ 
malization, randomly generated samples from both shape spaces trained after 
WSX and NH exhibit less variation in posture and are thus more similar to 
their corresponding posture-normalized training samples. 

Finally, we qualitatively examine the first five PC A components learned 
by the following S-SCAPE spaces: the current state-of-the-art shape space S- 
SCAPE [18], our shape space without posture normalization and with posture 
normalization using WSX and NH. The results are shown in Fig. [4] Major 
modes of shape variation by S-SCAPE (row 1) are affected by global and local 
posture-related deformations, such as moving of arms or tilting the body. In 
contrast, the principal modes of variation by our shape space (row 2) are mostly 
due to shape changes, which is achieved due to better template fitting proce¬ 
dure and a more representative training set. However, small posture variations 
are still part of the learned shape space. Performing posture normalization of 
the training samples prior to learning the shape space completely factors out 
changes due to posture, as can be seen in the major principal components of 
ours-\-WSX (row 3) and ours+NH (row 4). 

6. Human body reconstruction 

Finally, we evaluate our improved S-SCAPE spaces on the task of estimating 
human body shape from sparse and noisy visual input. We follow the approach 
by Helten et al. m to estimate the body shape of a person from two sequentially 
taken front and back depth images. First, body shape and posture are fitted 
independently to each depth image. Second, the obtained results are used as 
initialization of a method that jointly optimizes over shape and independently 
optimizes over posture parameters. This optimization strategy is used because 
the shape in both depth scans is of the same person, but the pose may differ. 
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st. dev. +3cr —3(7 +3cr — 3a +3 a —3 a +3cr — 3a +3 a —3a 

PCA id I II III IV V 

Figure 4: Visualization of the first five PCA eigenvectors scaled by ±3cr (standard deviation). 
Shown are eigenvectors of the S-SCAPE space m (row 1) and the S-SCAPE spaces trained 
using our pre-processed data without (row 2) and with posture normalization using WSX |38| 
(row 3) and NH | 21| (row 4). 
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6.1. Dataset and experimental setup 

We use a publicly available dataset on containing Kinect body scans of three 
males and three females. Examples of the Kinect scans are shown in Fig. [6ja). 
For each subject, a high-resolution laser scan was captured to determine “ground 
truth” body shape. Following the evaluation protocol of Helten et al. m we 
first fit a shape space to the depth data, then fit shape space to the ground 
truth scan, and finally compute the fitting error as a vertex-to-vertex Euclidean 
distance between the vertices of the depth-fitted mesh and the ground truth- 
fitted mesh. As the required landmarks are not available for this dataset, we 
manually placed 14 landmarks on each depth and laser scan. 

6.2. Quantitative evaluation 

For quantitative evaluation, we compare the following four shape models 
presented above: the current state-of-the-art shape space [18], our shape space 
without posture normalization and with posture normalization using WSX and 
NH. In our experiments, we vary the number of shape space parameters and 
the number of training samples. To evaluate the fitting accuracy, we report the 
proportion of vertices below a certain threshold. 

The results are shown in Fig. [5] where the number of shape space parameters 
varies in the columns and the number of training samples varies in the rows. In 
all cases our S-SC APE spaces learned from the CAESAR dataset significantly 
outperform the shape space by Jain et ah, which is learned from the far less 
representative MPI Human Shape dataset. Our models achieve good fitting ac¬ 
curacy when using as few as 20 shape parameters, and the performance stays 
stable when increasing the number of shape parameters up to 50 (first row). In 
contrast, the performance of the shape space by Jain et al. drops, possibly due 
to overfitting to unrealistic shape deformations in noisy depth data. Interest¬ 
ingly, better performance by our models is evident even in the case when all 
models are learned from the same number of training samples (third and fourth 
rows). This shows that the CAESAR data has higher shape variability than the 
MPI Human Shape data. In the majority of cases, the shape space learned from 
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Figure 5: Fitting error on the dataset of depth scans ca of S-SCAPE spaces by Jain et 
al. p~8 and S-SCAPE spaces trained using our processed data without and with posture 
normalization using WSX and NH. Shown is the proportion of vertices [%] for which the 
fitting error falls below a threshold. 


the posture-normalized samples with NH outperforms the shape space learned 
from samples without posture normalization. This shows that the posture nor¬ 
malization method of Neophytou and Hilton [21] helps to improve the accuracy 
of fitting to noisy depth data. Surprisingly, the shape space learned from sam¬ 
ples without posture normalization outperforms the shape space learned from 
the posture-normalized samples with WSX in most cases. Overall, the quanti¬ 
tative results show the advantages of our approach of building S-SCAPE spaces 
learned from a large representative set of training samples with additional pos¬ 
ture normalization. 
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Figure 6: Per-vertex shape fitting error (mm) of multiple methods on sample individuals from 
the dataset of Helten et al. El- 


6.3. Qualitative evaluation 

To qualitatively evaluate the fitting, we visualize the per-vertex fitting errors. 
We consider the S-SCAPE spaces learned from all available training samples and 
use 20 shape space parameters. For visualization we choose two subjects, male 
and female, where the differences among the shape spaces are most pronounced. 

Results are shown in Fig. [6j Our shape spaces better fit the data, in par¬ 
ticular in the areas of belly and chest. This is to be expected, as we learn 
from the larger and more representative CAESAR dataset. Both shape spaces 
trained from posture normalized models can better fit the arms compared to 
non-normalized models. 


7. Conclusion 

In this work we address the challenging problem of building an efficient and 
expressive 3D body shape space from the largest commercially available 3D body 
scan dataset [26]. We carefully design and evaluate different data preprocessing 
steps required to obtain high-quality body shape models. To that end we eval¬ 
uate different template fitting procedures. We observe that shape and posture 
fitting of an initial shape space to a scan prior to non-rigid deformation consid- 
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erably improves the fitting results. Our findings indicate that multiple passes 
over the dataset improve initialization and thus increase the overall fitting ac¬ 
curacy and statistical shape space qualities. Furthermore, we show that posture 
normalization prior to learning a shape space leads to significantly better gen¬ 
eralization and specificity of the S-SCAPE spaces. Finally, we demonstrate the 
advantages of our learned shape spaces over the state-of-the-art shape space of 
Jain et al. [18. learned on largest publicly available dataset m on the task of 
human body shape reconstruction from noisy depth data. 

We release our S-SCAPE spaces, registered CAESAR scans, raw scan pre¬ 
processing code, code to fit a S-SCAPE space to a raw scan and evaluation code 
for public usag^] We believe this contribution is required for future develop¬ 
ment in human body modeling. 
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